Many large web sites host a huge amount of information. This information is encoded and delivered on demand to the user within a web page, which is really just a markup language that a browser can understand. We can take this data and analyze it using R, via a variety of different means. Today we’ll cover scraping web tables and interacting via Automated Programming Interfaces.
Many of the tools we’ll cover can be quite powerful ways to interact with data stored online and gather it for analysis. Because they’re powerful, you need to be careful with them. In particular, try to request information in a way that will not burden the website owners. What constitutes a burden depends on the website. Google, Twitter, Facebook, all of the big websites have protections in place to guard against too many requests and have a huge amount of capacity for taking the requests they grant. Smaller websites may not have either. Always strive to be minimally intrusive: you’re usually getting this data for free.
R can understand a web connection via the url command. Once that connection is established, we can download whatever we’d like.
## Some notes: for this to work, need to make sure everything is already downloaded.
## then use a consistent file.exists setup
#Web connections: url
# example
r_home = url("http://www.r-project.org/")
r_home
## description class
## "http://www.r-project.org/" "url"
## mode text
## "r" "text"
## opened can read
## "closed" "yes"
## can write
## "no"
# Pulling text from a website using readlines
# url of Moby Dick (project Gutenberg)
moby_url = url("http://www.gutenberg.org/files/2701/2701-h/2701-h.htm")
# reading the content (first 500 lines)
moby_dick = readLines(moby_url, n = 1500)
moby_dick[1145:1148]
## [1] " Call me Ishmael. Some years ago—never mind how long precisely—having"
## [2] " little or no money in my purse, and nothing particular to interest me on"
## [3] " shore, I thought I would sail about a little and see the watery part of"
## [4] " the world. It is a way I have of driving off the spleen and regulating the"
When we talk about “scraping” a web table, we’re talking about pulling a table that exists on a website and turning it into a usable data frame for analysis. Below, I take the table from http://en.wikipedia.org/wiki/Marathon_world_record_progression for men’s marathon times and plot the change in speed in m/s as a function of the date that the world record was set.
#Scraping web tables
#example
# load XML
# url
marathon_wiki = "https://en.wikipedia.org/wiki/Marathon_world_record_progression"
marathon <- read_html(marathon_wiki)%>%html_table()
marathon<-tbl_df(data.frame(marathon[[1]]))
#Convert time to seconds
marathon<-marathon%>%mutate(Time2=hms(as.character(Time)))%>%mutate(Time2=period_to_seconds(Time2))
#Marathons are 42,200 meters long
marathon$speed<-(4.22e4)/marathon$Time2
#Get dates in a usable format
marathon$date<-mdy(marathon$Date)
## Warning: 1 failed to parse.
marathon<-marathon%>%mutate(Nationality=fct_reorder(f=as.factor(Nationality),x=-speed,fun = max))
g1<-ggplot(data=marathon,
aes(y=speed,x=marathon$date,
#Reorder nationality by fastest times
color=Nationality,
text=paste(Name,Time))
)
g1<-g1+geom_point()+
xlab("Date")+
ylab("Meters/Second")+
theme_igray()
ggplotly(g1)
Quick Exercise Repeat the above analysis for women’s world record progression.
To get data from a more complex site will take several steps. First, you need to figure out the strucutre for the url naming. Then you need to use something like selector gadget to access the data. We’ll go over this in class.
Below, I access some team data from the nba.com website. To get the table in shape, I needed to figure out the “node” associate with each column. This took some time. It turns out that each column in the table .left or .right is named nth-child(#) where # is a number from 1-30.
I first plug the web page I’m interested in to a variable called url. This makes it simpler to deal with in code.
## Web Page: could be lots of different things.
url<-"http://www.basketball-reference.com/leagues/NBA_2017_totals.html"
The next step is to download that web page onto my computer. I call this page and I use the read_html() function to get it.
page<-read_html(url)
The web page has all of the data we need. The trick is to figure out how it’s organized. To do this I need to use a tool called selectorgadget. What this will do is provide me with the name of the element I’m highlighting. I can also type in element names and see what’s highlighted. After some trial and error, I figured out that there are two main elements to the table on this page .left and .right. .left contains player names and positions, while .right has all of the relevant data for that player. Using the html_nodes() command, I extract player names from .left.
[selector_gadget][nba_total.png]
## Get data frame started: fill in player names
my_nodes<-paste0(".left:nth-child(",2,")")
## This code gets the data !
player_name<-page%>%
html_nodes(my_nodes)%>%
html_text()
player_name[1:10]
## [1] "Alex Abrines" "Quincy Acy" "Quincy Acy"
## [4] "Quincy Acy" "Steven Adams" "Arron Afflalo"
## [7] "Alexis Ajinca" "Cole Aldrich" "LaMarcus Aldridge"
## [10] "Lavoy Allen"
Using the same logic, I grab their points from the table
## Get data frame started: fill in player names
my_nodes<-paste0(".right:nth-child(",30,")")
## This code gets the data !
points<-page%>%
html_nodes(my_nodes)%>%
html_text()
points[1:10]
## [1] "406" "222" "13" "209" "905" "515" "207" "105" "1243" "177"
Now I can combine the two into a data frame. Note that I have to drop the last row of player names– the web page had an ad there.
player_name<-player_name[-length(player_name)]
simple_data<-tibble(player_name,points)
Next we can change the points variable to be numeric, and arrange in descending order.
simple_data<-simple_data%>%mutate_at("points",as.numeric)
simple_data<-simple_data%>%arrange(-points)
simple_data
## # A tibble: 595 x 2
## player_name points
## <chr> <dbl>
## 1 Russell Westbrook 2558
## 2 James Harden 2356
## 3 Isaiah Thomas 2199
## 4 Anthony Davis 2099
## 5 Karl-Anthony Towns 2061
## 6 Damian Lillard 2024
## 7 DeMar DeRozan 2020
## 8 Stephen Curry 1999
## 9 LeBron James 1954
## 10 DeMarcus Cousins 1942
## # ... with 585 more rows
Quick Exercise: Add field goals to the above data frame
The code below takes the exact same steps, but expands in two directions. First, it repeats over the years 1993-2017 to get a 25 year dataset. Next, it includes all of the columnins in the table from the NBA reference website.
## Pulling from a "stats" website
#Check: do you really want to get data?
get_nba_data<-FALSE
#initialize full dataset
nba_df_all<-NULL
#What years?
year_list<-1993:2017
if(get_nba_data==TRUE){
## Loop through years
for (year in year_list){
## Web Page: could be lots of different things.
url<-paste0("http://www.basketball-reference.com/leagues/NBA_",year,"_totals.html")
## Get Page
page<-read_html(url)
#select nodes from the table
my_nodes<-paste0(".right:nth-child(",1,")")
## This code gets the data !
player_numbers<-page%>%
html_nodes(my_nodes)%>%
html_text()
#Name
## Get data frame started: fill in player names
my_nodes<-paste0(".left:nth-child(",2,")")
## This code gets the data !
player_name<-page%>%
html_nodes(my_nodes)%>%
html_text()
player_name<-player_name[-length(player_name)]
#Age
#select nodes from the table
my_nodes<-paste0(".right:nth-child(",4,")")
## This code gets the data !
player_age<-page%>%
html_nodes(my_nodes)%>%
html_text()
#Position
#select nodes from the table
my_nodes<-paste0(".center:nth-child(3)")
## This code gets the data !
player_position<-page%>%
html_nodes(my_nodes)%>%
html_text()
#Fingers crossed on this kludge
player_position<-player_position[!player_position=="Pos"]
#select nodes from the table
my_nodes<-paste0(".left:nth-child(5)")
## This code gets the data !
player_team<-page%>%
html_nodes(my_nodes)%>%
html_text()
# stringr::str_extract("[\\w]+[\\s]+[\\w]+[^/t]") #You can use this too: gets text ONLY
## Create Data Frame
nba_df<-tibble(player_name,
player_position,
player_age,
player_team)
## Names from the web page
column_names<-c(1:5,
"games",
"games_started",
"minutes_played",
"fg",
"fg_attempts",
"fg_percent",
"three_pointers",
"three_point_attempts",
"three_point_percent",
"two_pointers",
"two_point_attempts",
"two_point_percent",
"effective_fg_percent",
"free_throws",
"free_throw_attempts",
"free_throw_percent",
"off_rebound",
"def_rebound",
"total_rebound",
"assists",
"steals",
"blocks",
"turnovers",
"fouls",
"pts"
)
## We want columns 6:30 from this page
my_numbers<-6:30
#Repeat over every column in the table
for (i in my_numbers){
## This is the hardest part: figuring out the part of the page we're interested in
my_nodes<-paste0(".right:nth-child(",i,")")
## Take that part, extract it from the page, change it to text
node_list<-page%>%
html_nodes(my_nodes)%>%
html_text()
## Add that column to the data frame
nba_df<-nba_df%>%add_column(node_list)
names(nba_df)[i-1]<-column_names[i]
}# End column loop
## Add year, since we'll be including multiple years
nba_df$year<-year
#Combine into full dataset
nba_df_all<-bind_rows(nba_df_all,nba_df)
# TOS ask for a 3s crawl delay: No Problem! https://www.sports-reference.com/robots.txt
## Wait between 3 and 5 seconds before going back for next season (being polite)
Sys.sleep(runif(1,3,5))
}#end year loop
#Convert numbers back from text to numbers using mutate_at
nba_df_all<-nba_df_all%>%mutate_at(vars(c(column_names[6:30])),as.numeric)
save(nba_df_all,file = "nba.Rdata")
}else{
load("nba.Rdata")
} #End conditional
Points per minute is one of the key measures of an NBA player’s output. Players with points per minute above .6 are considered in the league’s elite. The way that players achieve this has changed over time. In particular, players have relied much more on three-point shooting to achieve this benchmark. The graphic below shows points per minute as a function of the number of three pointers scored in a season. In 2001, top producers like Shaquille O’Neal shot no three pointers. By 2015, almost all of the top producers like LeBron James, Steph Curry and Kevin Durant relied on three-point shots as a large part of their production.
nba_df_all<-nba_df_all%>%mutate(pts_minute=pts/minutes_played)
gg<-ggplot(filter(nba_df_all,pts>1000,year>2000),aes(y=pts_minute,x=three_pointers,
text=player_name,
color=as.factor(year)))
gg<-gg+geom_hline(yintercept=.6,linetype="dashed",alpha=.3)
gg<-gg+geom_point(size=.25,alpha=.75)
gg<-gg+facet_wrap(~year)
gg<-gg+theme(legend.position="none")
ggplotly(gg)
Many websites have created Application Programming Interfaces, which allow the user to directly communicate with the website’s underlying database without dealing with the intermediary web content. These have been expanding rapdily and are one of the most exciting areas of development in data access for data science.
Today, we’ll be working with three APIs: google maps, Zillow and the American Community Survey from the census. Please go to: http://www.census.gov/developers/ and click on “Get a Key” to get your census key. Similarly, go to: http://www.zillow.com/howto/api/APIOverview.htm to get your key from Zillow.
YOU NEED TO PAY ATTENTION TO TERMS OF USE WHEN USING APIS. DO NOT VIOLATE THESE.
With these keys in hand, we can interact with these various databases. Today, we’re going to take a look at home prices as a function of income and education level for all of the zip codes in Davidson County TN.
The first step is to create a list of all zip codes in Davidson County. We can do this by using another dataset that includes a comprehensive listing of zip codes by county and city.
# Get dataset that matches all zip codes to cities, counties and states. You're Welcome!
load("zipCityCountyMap.Rdata")
#Rename it
names(zipMap2)<-c("state","county","zip","city_name","countyname","statename")
#select zip codes from Davidson county only
ziplist<-zipMap2$zip[zipMap2$countyname=="Davidson County, TN"]
#Get city names
citylist<-zipMap2$city_name[zipMap2$countyname=="Davidson County, TN"]
#Convert to tiblle
city_zip<-tibble(ziplist,citylist)
#Give it easy names
names(city_zip)<-c("zip","city")
#Arrange in order
city_zip<-city_zip%>%arrange(as.numeric(zip))
Next, we’ll turn to the American Community Survey. This includes a large number of tables (available http://www2.census.gov/acs2011_3yr/summaryfile/ACS_2009-2011_SF_Tech_Doc.pdf) that cover many demographic and other characteristics of the population, down to the level of zip codes. We’ll use the acs package to get two tables for the zip codes we’re interested in: levels of education and income. We’ll turn these tables into two variables: the proportion of the population with incomes above $75,000, and the proportion of the population with at least a bachelor’s degree.
The first step is to get the table from ACS
# Get your own key and save as my_acs_key.txt
my_acs_key<-readLines("my_acs_key.txt",warn = FALSE)
acs_key<-my_acs_key
# Or just paste it here.
#acs_key<-"<your_acs_key_here>"
#List of tables: https://www.census.gov/programs-surveys/acs/technical-documentation/summary-file-documentation.html under, 1-year appendices
# b15002: education of pop over 25, by sex
# b19001: household income over last 12 months
get_acs_data<-FALSE
if (get_acs_data==TRUE){
api.key.install(acs_key, file = "key.rda")
davidson.zip<-geo.make(zip.cod=ziplist)
davidson.educ=acs.fetch(geography=davidson.zip,
endyear=2014,
table.number="B15002", col.names="pretty")
acs.colnames(davidson.educ)
## Proprtion of individuals at college or above=
## number with college degree/
## total number
prop.coll.above<-divide.acs(numerator=(davidson.educ[,15]+
davidson.educ[,16]+
davidson.educ[,17]+
davidson.educ[,18]+
davidson.educ[,32]+
davidson.educ[,33]+
davidson.educ[,34]+
davidson.educ[,35]),
denominator=davidson.educ[,1]
)
# 19001-- family income
davidson.inc<-acs.fetch(geography=davidson.zip,
endyear = 2014,
table.number="B19001", col.names="pretty")
acs.colnames(davidson.inc)
#Proportion above 75k--
prop.above.75<-divide.acs(numerator=(davidson.inc[,13]+
davidson.inc[,14]+
davidson.inc[,15]+
davidson.inc[,16]+
davidson.inc[,17]),
denominator=davidson.inc[,1]
)
# Convert to tibble
dav_df<-tibble(substr(geography(davidson.educ)[[1]],7,11),
as.numeric(estimate(prop.coll.above)),
as.numeric(estimate(prop.above.75))
)
# Give it easy to use names
names(dav_df)<-c("zip","college_educ","income_75")
save(dav_df,file="dav.RData")
}else{
load("dav.RData")
}
#View(dav_df)
Quick Exercise Pull table B23001 “Sex by Age by Employment Status for the Population 16 and over” from ACS.
Now, we’ll turn to the Zillow API. Using the Zillow library, we’ll interact with this API to get average levels of home prices and price per square foot in these zip codes. Notice that I don’t download any dataset if it already exists: this avoids getting too many calls to Zillow– they won’t take more than 1,000 a day (which actually isn’t a lot).
We’re going to interact directly with the Zillow api, as opposed to using an R “wrapper” like “acs.R”. This means we have to have a properly formatted request AND be able to interpret the results when they come back.
# Get your own key and save it as zillow_key.txt
zwsid<-readLines("zillow_key.txt",warn=FALSE)
#Or just paste it here
#zwsid<-"<your_zillow_key_here"
# List of zips for which XML file with Zillow demographic data is to be extracted
get_zillow_data=FALSE
if (get_zillow_data==TRUE){
#Create a blank data frame
zdemodata<- list(zip = list(as.character()),
medListPrice=list(as.numeric()),
medValsqFt=list(as.numeric())
)
# What is all This?
#repeat across all zip codes
for (i in 1:length(ziplist)) {
# url needs zillow key and zip code
url=paste0("http://www.zillow.com/webservice/GetDemographics.htm?zws-id=",zwsid,"&zip=",ziplist[i])
#parse result
x=xmlInternalTreeParse(url)
#Add zip code to null dataset
zdemodata$zip[i]=ziplist[i]
#Look for right element in xml file, pull it
x2=xpathApply(x,"//table[name = 'Affordability Data']/data/attribute[name = 'Median List Price']/values/zip/value",xmlValue)
# add that element to the dataset
zdemodata$medListPrice[i]=x2[[1]][1]
# And again, find element, pull it
x3=xpathApply(x,"//table[name = 'Affordability Data']/data/attribute[name = 'Median Value Per Sq Ft']/values/zip/value",xmlValue)
# add element to dataset
zdemodata$medValSqFt[i]=x3[[1]][1]
}
#Convert to tiblle
zdemodata2<-data.frame(
as.character(unlist(zdemodata$zip)),
as.character(unlist(zdemodata$medListPrice)),
as.character(unlist(zdemodata$medValSqFt))
)
names(zdemodata2)<-c("zip","medListPrice","medValSqFt")
zdemodata2<-as.tibble(zdemodata2)
zdemodata2<-zdemodata2%>%mutate_at(c("medListPrice","medValSqFt"),as.character)
#Change relevant variables to numeric
zdemodata2<-zdemodata2%>%mutate_at(c("medListPrice","medValSqFt"),as.numeric)
save(zdemodata2,file="zdemodata.RData")
}else{
load("zdemodata.RData")
}
head(zdemodata2)
## # A tibble: 6 x 3
## zip medListPrice medValSqFt
## <fctr> <dbl> <dbl>
## 1 37218 55 55
## 2 37076 259900 126
## 3 37205 699900 253
## 4 37209 389900 220
## 5 37013 220000 117
## 6 37122 369000 137
Quick Exercise: Get median 4-bedroom home value and add it to the above data frame
Now we merge these two datasets by zip code:
#Merge two datasets
dav_house_df<-left_join(zdemodata2,dav_df,by="zip")
## Warning: Column `zip` joining factor and character vector, coercing into
## character vector
dav_house_df<-left_join(dav_house_df,city_zip,by="zip")
First, we can show values of homes (per square foot) by zip code.
#Need to order zip codes for barplot
dav_house_df<-dav_house_df%>%mutate(zipfactor=factor(as.character(zip)))
dav_house_df<-dav_house_df%>%mutate(zipfactor=fct_reorder(f = zipfactor,x=medValSqFt))
#Bar plot of value by zip code
g6<-ggplot(data=filter(dav_house_df,medValSqFt>55),
aes(x=zipfactor,y=medValSqFt,fill=medValSqFt,text=city))
g6<-g6+scale_fill_gradient(low="blue",high="red")
g6<-g6+geom_bar(stat="identity")+coord_flip()
g6<-g6+xlab("")+ylab("Median Value per Square Foot, Nashville Neighborhoods")
g6<-g6+theme(legend.position="none")
ggplotly(g6)
We can run a regression predicting value per square foot by income and education:
#Predict square footage by income and education
mod1<-lm(medValSqFt~college_educ+income_75,data=dav_house_df);summary(mod1)
##
## Call:
## lm(formula = medValSqFt ~ college_educ + income_75, data = dav_house_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -146.840 -50.624 -3.331 37.704 210.363
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 61.39 32.85 1.869 0.0706 .
## college_educ 65.21 75.35 0.865 0.3931
## income_75 178.87 91.33 1.959 0.0587 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 84.11 on 33 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.1968, Adjusted R-squared: 0.1481
## F-statistic: 4.043 on 2 and 33 DF, p-value: 0.02688
Quick Exercise Predict total home price by the same factors.
We can also create a scatterplot that shows median value per square foot as a function of education.
#Plot
g2<-ggplot(data=filter(dav_house_df,medListPrice>0,medValSqFt>55),
aes(x=college_educ,y=medValSqFt))
g2<-g2+geom_point()
g2<-g2+xlab("Proportion of Population With a College Degree")+ylab("Median Value per Square Foot")
g2<-g2+geom_smooth(method=lm)
g2<-g2+theme_igray()
g2
Quick Exercise Create a scatterplot with income as the x variable.
The last set of code maps these zip codes. I have a file that shows the “shape” of the zip codes in Davidson county. Using the get_googlemap function, we’ll pull a google map centered on Davidson county. I then overlay the zip codes on top of the google map, then fill these in using the level of income.
## FIX: need to do this the SF way using the tiger package
#Just because: mapping this
zip_shapes<-zctas(cb=TRUE, #simpler and smaller
starts_with =37
)
zip_shapes<-zip_shapes[zip_shapes$ZCTA5CE10%in%dav_house_df$zip,]
zip_shapes$zip<-zip_shapes$ZCTA5CE10
#dav_house_df<-as.data.frame(dav_house_df)
#dav_house_df$zip<-as.character(dav_house_df$zip)
davidson_map<-geo_join(zip_shapes,
dav_house_df,
"zip","zip")
popup <- paste0("Zip: ", davidson_map$zip,
"<br>",
davidson_map$city,
"<br>",
"Median Value per Square Foot: ",
round(davidson_map$medValSqFt,2),
"<br>",
"Percent of Pop with A Bachelors:",
round(davidson_map$college_educ,2),
"<br>",
"Median List Price:",
prettyNum(davidson_map$medListPrice,big.mark = ",")
)
#Palette for Square
sqftpal <- colorNumeric(
palette = "YlGnBu",
domain = davidson_map$medValSqFt
)
price_pal <- colorNumeric(
palette = "YlGnBu",
domain = davidson_map$medListPrice
)
educ_pal <- colorNumeric(
palette = "YlGnBu",
domain = davidson_map$college_educ
)
map3<-leaflet() %>%
addProviderTiles("CartoDB.Positron") %>%
addPolygons(data = davidson_map,
fillColor = ~sqftpal(medValSqFt),
color = "#b2aeae", # you need to use hex colors
fillOpacity = 0.7,
weight = 1,
smoothFactor = 0.2,
popup = popup,
group="Price/ Sq Ft"
) %>%
addPolygons(data = davidson_map,
fillColor = ~price_pal(medListPrice),
color = "#b2aeae", # you need to use hex colors
fillOpacity = 0.7,
weight = 1,
smoothFactor = 0.2,
popup = popup,
group="Home Price"
) %>%
addPolygons(data = davidson_map,
fillColor = ~educ_pal(college_educ),
color = "#b2aeae", # you need to use hex colors
fillOpacity = 0.7,
weight = 1,
smoothFactor = 0.2,
popup = popup,
group="Education"
) %>%
addLayersControl(
baseGroups = c("Price/ Sq Ft"),
overlayGroups = c("Education", "Price/ Sq Ft","Home Price"),
options = layersControlOptions(collapsed = FALSE)
)
map3